The aim of this guide is to make you familiar with mice in R and to enhance your understanding of multiple imputation, in general. You will learn how to perform and inspect multiple imputations and how to pool the results of analyses performed on multiply-imputed data, how to approach different types of data and how to avoid some of the pitfalls many data scientists may fall into. The main objective is to increase your knowledge and understanding on applications of multiple imputation.

We start by loading (with library()) the necessary packages and fixing the random seed to allow for our outcomes to be replicable.

library(mice)    # data imputation
library(lattice) # plotting device
library(dplyr)   # data manipulation
library(magrittr)# pipes
library(purrr)   # functional programming

All the best,

Gerko and Stef


Multiple imputation with mice


We fix the RNG seed to allow for replication of the below results.

set.seed(123)

The mice package contains several datasets. Once the package is loaded, these datasets can be used. Have a look at the nhanes dataset (Schafer, 1997, Table 6.14) by typing

nhanes
##    age  bmi hyp chl
## 1    1   NA  NA  NA
## 2    2 22.7   1 187
## 3    1   NA   1 187
## 4    3   NA  NA  NA
## 5    1 20.4   1 113
## 6    3   NA  NA 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2   NA  NA  NA
## 11   1   NA  NA  NA
## 12   2   NA  NA  NA
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1  NA
## 16   1   NA  NA  NA
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2  NA
## 21   1   NA  NA  NA
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1  NA
## 25   2 27.4   1 186

The nhanes dataset is a small data set with non-monotone missing values. It contains 25 observations on four variables: age group, body mass index, hypertension and cholesterol (mg/dL).

To learn more about the data, use one of the two following help commands:

help(nhanes)
?nhanes

1. Vary the number of imputations, such that the nhanes set is imputed \(m=3\) times.

The number of imputed data sets can be specified by the m = ... argument. For example, to create just three imputed data sets, specify

imp <- mice(nhanes, m = 3, print=FALSE)

The print = FALSE argument omits printing of the iteration history from the output. The main reason to omit printing here is to save space in the document.


2. Change the predictor matrix

The predictor matrix is a square matrix that specifies the variables that can be used to impute each incomplete variable. Let us have a look at the predictor matrix that was used

imp$pred
##     age bmi hyp chl
## age   0   1   1   1
## bmi   1   0   1   1
## hyp   1   1   0   1
## chl   1   1   1   0

Each variable in the data has a row and a column in the predictor matrix. A value 1 indicates that the column variable was used to impute the row variable. For example, the 1 at entry [bmi, age] indicates that variable age was used to impute the incomplete variable bmi. Note that the diagonal is zero because a variable is not allowed to impute itself. The row of age is redundant, because there were no missing values in age. Even though predictor relations are specified for age, mice will not use these relations because it will never overwrite the observed values with imputations. mice gives you complete control over the predictor matrix, enabling you to choose your own predictor relations. This can be very useful, for example, when you have many variables or when you have clear ideas or prior knowledge about relations in the data at hand.

There are two ways in which you can create a predictor matrix in mice:

  • A. You can grab the predictorMatrix from every object returned by mice():

For example, we can use any mice() object fitted to the data to grab the predictorMatrix or we can use mice to quickly initialize a predictor matrix, and change it afterwards, without running the algorithm. This latter approach can be done by setting the maximum number of iterations to maxit=0. This leaves the algorithm at initialization, but generates the necessary inner objects.

ini <- mice(nhanes, maxit=0, print=FALSE)
pred <- ini$pred
pred
##     age bmi hyp chl
## age   0   1   1   1
## bmi   1   0   1   1
## hyp   1   1   0   1
## chl   1   1   1   0

The object pred contains the predictor matrix from an initial run of mice with zero iterations. It is a square matrix that captures the information about which variables (in the rows) are imputed based on which predictors (in the columns).

  • B. We can use make.predictorMatrix() to generate a predictor matrix from any incomplete data set.

For example,

pred <- make.predictorMatrix(nhanes)
pred
##     age bmi hyp chl
## age   0   1   1   1
## bmi   1   0   1   1
## hyp   1   1   0   1
## chl   1   1   1   0

Altering the predictor matrix and returning it to the mice algorithm is very simple. For example, the following code removes the variable hyp from the set of predictors, but still leaves it to be predicted by the other variables.

pred[, "hyp"] <- 0
pred
##     age bmi hyp chl
## age   0   1   0   1
## bmi   1   0   0   1
## hyp   1   1   0   1
## chl   1   1   0   0

Use your new predictor matrix in mice() as follows

imp <- mice(nhanes, predictorMatrix =  pred, print = F)

As you can see, we can easily feed the new matrix pred to mice. We can also abbreviate the logical operater in the argument print=FALSE.

There is a quickpred() function that applies a quick selection procedure of predictors, which can be handy for datasets containing many variables. See ?quickpred for more info. Selecting predictors according to data relations with a minimum correlation of \(\rho=.30\) can be done by

ini <- mice(nhanes, pred=quickpred(nhanes, mincor=.3), print=F)
ini$pred
##     age bmi hyp chl
## age   0   0   0   0
## bmi   1   0   0   1
## hyp   1   0   0   1
## chl   1   1   1   0

For large predictor matrices, it can be useful to export them to dedicated spreadsheet software like e.g. Microsoft Excel for easier configuration (e.g. see the xlsx package for easy exporting and importing of Excel files). Importing data is straightforward in RStudio through File > Import Dataset.


3. Inspect the convergence of the algorithm

The mice() function implements an iterative Markov Chain Monte Carlo type of algorithm. Let us have a look at the trace lines generated by the algorithm to study convergence:

imp <- mice(nhanes, print=F)
plot(imp)

The plot shows the mean (left) and standard deviation (right) of the imputed values only. In general, we would like the streams to intermingle (mixing) and be free of any trends at the later iterations (non-stationary). We inspect trends for the imputed values alone, because the observed data does not change. In our case we cannot speak of convergence, especially not for bmi. More iterations or a different model are needed.

The mice algorithm uses random sampling, and therefore, the results will be (perhaps slightly) different if we repeat the imputations with different seeds. In order to get identical mice objects between calls, we can fix the use the seed argument.

imp <- mice(nhanes, seed=123, print=F)

where 123 is some arbitrary number that you can choose yourself. Rerunning this command will always yields the same imputed values.


4. Change the imputation method

For each column, the algorithm requires a specification of the imputation method. To see which method was used by default:

imp$meth
##   age   bmi   hyp   chl 
##    "" "pmm" "pmm" "pmm"

The variable age is complete and therefore not imputed, denoted by the "" empty string. The other variables have method pmm, which stands for predictive mean matching, the default in mice for numerical and integer data.

In reality, the nhanes data are better described a as mix of numerical and categorical data. Let us take a look at the nhanes2 data frame:

summary(nhanes2)
##     age          bmi          hyp          chl       
##  20-39:12   Min.   :20.40   no  :13   Min.   :113.0  
##  40-59: 7   1st Qu.:22.65   yes : 4   1st Qu.:185.0  
##  60-99: 6   Median :26.75   NA's: 8   Median :187.0  
##             Mean   :26.56             Mean   :191.4  
##             3rd Qu.:28.93             3rd Qu.:212.0  
##             Max.   :35.30             Max.   :284.0  
##             NA's   :9                 NA's   :10

and the structure of the data frame

str(nhanes2)
## 'data.frame':    25 obs. of  4 variables:
##  $ age: Factor w/ 3 levels "20-39","40-59",..: 1 2 1 3 1 3 1 1 2 2 ...
##  $ bmi: num  NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ...
##  $ hyp: Factor w/ 2 levels "no","yes": NA 1 1 NA 1 NA 1 1 1 NA ...
##  $ chl: num  NA 187 187 NA 113 184 118 187 238 NA ...

Variable age consists of 3 age categories, while variable hyp is binary. The mice() function takes these properties automatically into account. Impute the nhanes2 dataset

imp <- mice(nhanes2, print=F)
imp$meth
##      age      bmi      hyp      chl 
##       ""    "pmm" "logreg"    "pmm"

Notice that mice has set the imputation method for variable hyp to logreg, which implements multiple imputation by logistic regression.

An up-to-date overview of the methods in mice can be found by

methods(mice)
##  [1] mice.impute.2l.bin              mice.impute.2l.lmer            
##  [3] mice.impute.2l.norm             mice.impute.2l.pan             
##  [5] mice.impute.2lonly.mean         mice.impute.2lonly.norm        
##  [7] mice.impute.2lonly.pmm          mice.impute.cart               
##  [9] mice.impute.jomoImpute          mice.impute.lasso.logreg       
## [11] mice.impute.lasso.norm          mice.impute.lasso.select.logreg
## [13] mice.impute.lasso.select.norm   mice.impute.lda                
## [15] mice.impute.logreg              mice.impute.logreg.boot        
## [17] mice.impute.mean                mice.impute.midastouch         
## [19] mice.impute.mnar.logreg         mice.impute.mnar.norm          
## [21] mice.impute.norm                mice.impute.norm.boot          
## [23] mice.impute.norm.nob            mice.impute.norm.predict       
## [25] mice.impute.panImpute           mice.impute.passive            
## [27] mice.impute.pmm                 mice.impute.polr               
## [29] mice.impute.polyreg             mice.impute.quadratic          
## [31] mice.impute.rf                  mice.impute.ri                 
## [33] mice.impute.sample              mice.mids                      
## [35] mice.theme                     
## see '?methods' for accessing help and source code

Let us change the imputation method for bmi to Bayesian normal linear regression imputation

meth <- make.method(nhanes2)
meth
##      age      bmi      hyp      chl 
##       ""    "pmm" "logreg"    "pmm"
meth["bmi"] <- "norm"
meth
##      age      bmi      hyp      chl 
##       ""   "norm" "logreg"    "pmm"

and run the imputations again.

imp <- mice(nhanes2, meth = meth, print=F)

We may now again plot trace lines to study convergence

plot(imp)


5. Extend the number of iterations

Though using just five iterations (the default) often works well in practice, we need to extend the number of iterations of the mice algorithm to confirm that there is no trend and that the trace lines intermingle well. We can increase the number of iterations to 40 by running 35 additional iterations using the mice.mids() function.

imp40 <- mice.mids(imp, maxit=35, print=F)
plot(imp40)


6. Further diagnostic checking. Use function stripplot().

Generally, one would prefer for the imputed data to be plausible values, i.e. values that could have been observed if they had not been missing. In order to form an idea about plausibility, one may check the imputations and compare them against the observed values. If we are willing to assume that the data are missing completely at random (MCAR), then the imputations should have the same distribution as the observed data. In general, distributions may be different because the missing data are MAR (or even MNAR). However, very large discrepancies need to be screened. Let us plot the observed and imputed data of chl by

stripplot(imp, chl~.imp, pch=20, cex=2)

The convention is to plot observed data in blue and the imputed data in red. The figure graphs the data values of chl before and after imputation. Since the PMM method draws imputations from the observed data, imputed values have the same gaps as in the observed data, and are always within the range of the observed data. The figure indicates that the distributions of the imputed and the observed values are similar. The observed data have a particular feature that, for some reason, thedata cluster around the value of 187. The imputations reflect this feature, and are close to the data. Under MCAR, univariate distributions of the observed and imputed data are expected to be identical. Under MAR, they can be different, both in location and spread, but their multivariate distribution is assumed to be identical. There are many other ways to look at the imputed data.

The following command creates a simpler version of the graph from the previous step and adds the plot for bmi.

stripplot(imp)

Remember that bmi was imputed by Bayesian linear regression and (the range of) imputed values may therefore be different than observed values.


Repeated analysis in mice #1


7. Perform the following regression analysis on the multiply imputed data and assign the result to object fit.

\[ \text{bmi} = \beta_0 + \beta_1 \text{chl} + \epsilon \]

Let’s run the above model on the imputed data set.

fit <- with(imp, lm(bmi ~ chl))
fit
## call :
## with.mids(data = imp, expr = lm(bmi ~ chl))
## 
## call1 :
## mice(data = nhanes2, method = meth, printFlag = F)
## 
## nmis :
## age bmi hyp chl 
##   0   9   8  10 
## 
## analyses :
## [[1]]
## 
## Call:
## lm(formula = bmi ~ chl)
## 
## Coefficients:
## (Intercept)          chl  
##    23.32599      0.02282  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = bmi ~ chl)
## 
## Coefficients:
## (Intercept)          chl  
##    21.92498      0.02085  
## 
## 
## [[3]]
## 
## Call:
## lm(formula = bmi ~ chl)
## 
## Coefficients:
## (Intercept)          chl  
##     20.2116       0.0344  
## 
## 
## [[4]]
## 
## Call:
## lm(formula = bmi ~ chl)
## 
## Coefficients:
## (Intercept)          chl  
##    20.15665      0.03852  
## 
## 
## [[5]]
## 
## Call:
## lm(formula = bmi ~ chl)
## 
## Coefficients:
## (Intercept)          chl  
##    24.80765      0.01157

The fit object contains the regression summaries for each data set. The new object fit is actually of class mira (multiply imputed repeated analyses).

class(fit)
## [1] "mira"   "matrix"

Use the ls() function to what out what is in the object.

ls(fit)
## [1] "analyses" "call"     "call1"    "nmis"

Suppose we want to find the regression model fitted to the second imputed data set. It can be found as

summary(fit$analyses[[2]])
## 
## Call:
## lm(formula = bmi ~ chl)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6386 -3.1231 -0.2381  2.8443  8.8307 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 21.92498    3.95950   5.537 1.24e-05 ***
## chl          0.02085    0.01970   1.058    0.301    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.264 on 23 degrees of freedom
## Multiple R-squared:  0.04642,    Adjusted R-squared:  0.004965 
## F-statistic:  1.12 on 1 and 23 DF,  p-value: 0.301

8. Pool the analyses from object fit.

Pooling the repeated regression analyses can be done simply by typing

pool.fit <- pool(fit)
summary(pool.fit)
##          term    estimate  std.error statistic       df      p.value
## 1 (Intercept) 22.08537216 4.58683836  4.814945 13.40323 0.0003100903
## 2         chl  0.02563088 0.02338968  1.095820 12.46079 0.2938853567
pool.fit
## Class: mipo    m = 5 
##          term m    estimate         ubar            b            t dfcom
## 1 (Intercept) 5 22.08537216 16.177078691 4.0516728978 2.103909e+01    23
## 2         chl 5  0.02563088  0.000405596 0.0001179011 5.470773e-04    23
##         df       riv   lambda       fmi
## 1 13.40323 0.3005492 0.231094 0.3248446
## 2 12.46079 0.3488233 0.258613 0.3545184

which gives the relevant pooled regression coefficients and parameters, as well as the fraction of information about the coefficients missing due to nonresponse (fmi) and the proportion of the variation attributable to the missing data (lambda). The pooled fit object is of class mipo, which stands for multiply imputed pooled object.

Alternatively, we could use a functional programming pipe to achieve the same

fit <- imp %>% 
  complete("all") %>% # list where each listed element is a completed set
  map(lm, formula = bmi ~ chl) %>% 
  pool() %>% 
  summary()

Have a look at the different workflows that can be adopted with mice in this chapter in Van Buuren’s book.

mice can to pool many analyses from a variety of packages for you (it uses broom to gather all parameters). For flexibility and in order to run custom pooling functions, mice also incorporates a function pool.scalar() which pools univariate estimates of \(m\) repeated complete data analysis conform Rubin’s pooling rules (Rubin, 1987, paragraph 3.1)


The boys data set


9. The boys dataset is part of mice. It is a subset of a large Dutch dataset containing growth measures from the Fourth Dutch Growth Study. Inspect the help for boys dataset and make yourself familiar with its contents.

To learn more about the contents of the data, use one of the two following help commands:

help(boys)
?boys

10. Get an overview of the data. Find information about the size of the data, the variables measured and the amount of missingness.

The first 10 cases are:

head(boys, n = 10)
##      age  hgt   wgt   bmi   hc  gen  phb tv   reg
## 3  0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
## 4  0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
## 23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
## 28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
## 36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south
## 37 0.068 52.5 3.810 13.82 34.9 <NA> <NA> NA south
## 38 0.071 53.0 3.890 13.84 35.8 <NA> <NA> NA  west
## 39 0.071 55.1 3.880 12.77 36.8 <NA> <NA> NA  west
## 43 0.073 54.5 4.200 14.14 38.0 <NA> <NA> NA  east

The last 10 cases are:

tail(boys, n = 10)
##         age   hgt  wgt   bmi   hc  gen  phb tv   reg
## 7329 20.032 184.0 73.0 21.56 56.0 <NA> <NA> NA north
## 7362 20.117 188.7 89.4 25.10 58.1   G5   P6 25  east
## 7396 20.281 185.1 81.1 23.67 58.8   G5   P6 20 south
## 7405 20.323 182.5 69.0 20.71 59.0 <NA> <NA> NA north
## 7410 20.372 188.7 59.8 16.79 55.2 <NA> <NA> NA  west
## 7418 20.429 181.1 67.2 20.48 56.6 <NA> <NA> NA north
## 7444 20.761 189.1 88.0 24.60   NA <NA> <NA> NA  west
## 7447 20.780 193.5 75.4 20.13   NA <NA> <NA> NA  west
## 7451 20.813 189.0 78.0 21.83 59.9 <NA> <NA> NA north
## 7475 21.177 181.8 76.5 23.14   NA <NA> <NA> NA  east

We now have a clear indication that the data are sorted. A simple evaluation

!is.unsorted(boys$age)
## [1] TRUE

confirms this - !is.unsorted() evaluates the complement of is.unsorted(), so it tests whether the data are sorted. There is no is.sorted function in R.

The dimensions of the boys data set are:

dim(boys)
## [1] 748   9

We see that the boys data set has 748 cases over 9 variables. From those 9 variables

summary(boys)
##       age              hgt              wgt              bmi       
##  Min.   : 0.035   Min.   : 50.00   Min.   :  3.14   Min.   :11.77  
##  1st Qu.: 1.581   1st Qu.: 84.88   1st Qu.: 11.70   1st Qu.:15.90  
##  Median :10.505   Median :147.30   Median : 34.65   Median :17.45  
##  Mean   : 9.159   Mean   :132.15   Mean   : 37.15   Mean   :18.07  
##  3rd Qu.:15.267   3rd Qu.:175.22   3rd Qu.: 59.58   3rd Qu.:19.53  
##  Max.   :21.177   Max.   :198.00   Max.   :117.40   Max.   :31.74  
##                   NA's   :20       NA's   :4        NA's   :21     
##        hc          gen        phb            tv           reg     
##  Min.   :33.70   G1  : 56   P1  : 63   Min.   : 1.00   north: 81  
##  1st Qu.:48.12   G2  : 50   P2  : 40   1st Qu.: 4.00   east :161  
##  Median :53.00   G3  : 22   P3  : 19   Median :12.00   west :239  
##  Mean   :51.51   G4  : 42   P4  : 32   Mean   :11.89   south:191  
##  3rd Qu.:56.00   G5  : 75   P5  : 50   3rd Qu.:20.00   city : 73  
##  Max.   :65.00   NA's:503   P6  : 41   Max.   :25.00   NA's :  3  
##  NA's   :46                 NA's:503   NA's   :522

function summary() informs us that testicular volume tv has the most missings, followed by the genital and pubic hair stages gen and phb, each with 503 missing cells.


11. As we have seen before, the function md.pattern() can be used to display all different missing data patterns. How many different missing data patterns are present in the boys dataframe and which pattern occurs most frequently in the data?

md.pattern(boys)

##     age reg wgt hgt bmi hc gen phb  tv     
## 223   1   1   1   1   1  1   1   1   1    0
## 19    1   1   1   1   1  1   1   1   0    1
## 1     1   1   1   1   1  1   1   0   1    1
## 1     1   1   1   1   1  1   0   1   0    2
## 437   1   1   1   1   1  1   0   0   0    3
## 43    1   1   1   1   1  0   0   0   0    4
## 16    1   1   1   0   0  1   0   0   0    5
## 1     1   1   1   0   0  0   0   0   0    6
## 1     1   1   0   1   0  1   0   0   0    5
## 1     1   1   0   0   0  1   1   1   1    3
## 1     1   1   0   0   0  0   1   1   1    4
## 1     1   1   0   0   0  0   0   0   0    7
## 3     1   0   1   1   1  1   0   0   0    4
##       0   3   4  20  21 46 503 503 522 1622

There are 13 patterns in total, with the pattern where gen, phb and tv are missing occuring the most.


12. How many patterns occur for which the variable gen (genital Tannerstage) is missing?

mpat <- md.pattern(boys, plot = FALSE)
sum(mpat[, "gen"] == 0)
## [1] 8

Answer: 8 patterns (503 cases)


13. Let us focus more precisely on the missing data patterns. Does the missing data of gen depend on age? One could for example check this by making a histogram of age separately for the cases with known genital stages and for cases with missing genital stages.

To create said histogram in R, a missingness indicator for gen has to be created. A missingness indicator is a dummy variable with value 1 for observed values (in this case genital status) and 0 for missing values. Create a missingness indicator for gen by typing

R <- is.na(boys$gen) 
head(R, n = 100)
##   [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [91] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
tail(R, n = 100)
##   [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
##  [13]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE
##  [25] FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE
##  [37]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
##  [49]  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
##  [61] FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE
##  [73] FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE
##  [85] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
##  [97]  TRUE  TRUE  TRUE  TRUE
length(R)
## [1] 748

As we can see, the missingness indicator tells us for each of the 748 values in gen whether it is missing (TRUE) or observed (FALSE).

A histogram can be made with the function lattice::histogram().

lattice::histogram(boys$gen)

or, equivalently, one could use

lattice::histogram(~ gen, data = boys)

Writing the latter line of code for plots is more efficient than selecting every part of the boys data with the boys$... command, especially if plots become more advanced. The code for a conditional histogram of age given R is

lattice::histogram(~ age | R, data=boys)

The histogram shows that the missingness in gen is not equally distributed across age; or, equivalently, age seems to be differently distributed for observed and missing gen.


14. Impute the boys dataset with mice using all default settings and name the mids (multiply imputed data set) object imp.

imp <- mice(boys, print=FALSE)

15. Compare the means of the imputed data with the means of the incomplete data. First, we calculate the observed data means:

boys %>% 
  select(-phb, -gen, -reg) %>%
  colMeans(na.rm = TRUE)
##        age        hgt        wgt        bmi         hc         tv 
##   9.158866 132.151786  37.153187  18.068556  51.505983  11.893805

and then the means for the \(m\) imputed sets:

imp %>%
  mice::complete("all") %>%
  map(select, -phb, -gen, -reg) %>%  
  map(colMeans)
## $`1`
##        age        hgt        wgt        bmi         hc         tv 
##   9.158866 131.051738  37.174894  18.049131  51.595588   8.425134 
## 
## $`2`
##        age        hgt        wgt        bmi         hc         tv 
##   9.158866 131.112834  37.122434  18.036029  51.636631   8.459893 
## 
## $`3`
##        age        hgt        wgt        bmi         hc         tv 
##   9.158866 131.016578  37.139400  18.056029  51.634759   8.359626 
## 
## $`4`
##        age        hgt        wgt        bmi         hc         tv 
##   9.158866 131.075802  37.152969  18.044211  51.647460   8.382353 
## 
## $`5`
##        age        hgt        wgt        bmi         hc         tv 
##   9.158866 131.050134  37.110523  18.034011  51.615909   8.171123

Most means are roughly equal, except the mean of tv, which is much lower in the imputed data sets, when compared to the incomplete data. This makes sense because most genital measures are unobserved for the lower ages. When imputing these values, the means should decrease.

Investigating univariate properties by using functions such as summary(), may not be ideal in the case of hundreds of variables. To extract just the information you need, for all imputed datasets, we can make use of the with() function. To obtain summaries for each imputed tv only, type

imp %>%
  with(summary(tv)) %>%
  summary()
## # A tibble: 5 × 6
##   minimum    q1 median  mean    q3 maximum
##     <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
## 1       1     2      3  8.43    15      25
## 2       1     2      3  8.46    15      25
## 3       1     2      3  8.36    15      25
## 4       1     2      4  8.38    15      25
## 5       1     2      3  8.17    15      25

And to obtain e.g. the means alone, run

imp %>%
  with(mean(tv)) %>%
  summary()
## # A tibble: 5 × 1
##       x
##   <dbl>
## 1  8.43
## 2  8.46
## 3  8.36
## 4  8.38
## 5  8.17

Repeated analysis in mice #2

16. Calculate a correlation between all continuous variables for the imputed boys data

There are two ways in which we can calculate the correlation on the imputed data:

  • The wrong way: calculate an estimate over the average imputed dataset .

Quite often people are suggesting that using the average imputed dataset - so taking the average over the imputed data set such that any realized cell depicts the average over the corresponding data in the imputed data - would be efficient and conform Rubin’s rules. This is not true. Doing this will yield false inference.

To demonstrate this, let’s ceate the averaged data set and exclude the non-numerical columns:

ave <- imp %>%
  mice::complete("long") %>%
  group_by(.id) %>%
  summarise_all(.funs = mean) %>%
  select(-.id, -.imp, -phb, -gen, -reg)

head(ave)
## # A tibble: 6 × 6
##     age   hgt   wgt   bmi    hc    tv
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.035  50.1  3.65  14.5  33.7   1.6
## 2 0.038  53.5  3.37  11.8  35     4.2
## 3 0.057  50    3.14  12.6  35.2   4.2
## 4 0.06   54.5  4.27  14.4  36.7   1.8
## 5 0.062  57.5  5.03  15.2  37.3   2  
## 6 0.068  55.5  4.66  15.1  37     1.6

If we now calculate Pearson’s correlation, rounded to two digits:

cor.wrong <- ave %>%
  cor() %>%
  round(digits = 2)

we obtain:

cor.wrong
##      age  hgt  wgt  bmi   hc   tv
## age 1.00 0.98 0.95 0.63 0.86 0.86
## hgt 0.98 1.00 0.94 0.60 0.91 0.80
## wgt 0.95 0.94 1.00 0.79 0.84 0.86
## bmi 0.63 0.60 0.79 1.00 0.59 0.64
## hc  0.86 0.91 0.84 0.59 1.00 0.67
## tv  0.86 0.80 0.86 0.64 0.67 1.00
  • The correct way: calculate an estimate for each imputed dataset and average over the estimates

It is best to do a Fisher transformation before pooling the correlation estimates - and a backtransformation afterwards. Therefore we define the following two functions that allow us to transform and backtransform any value:

fisher.trans <- function(x) 1/2 * log((1 + x) / (1 - x))
fisher.backtrans <- function(x) (exp(2 * x) - 1) / (exp(2 * x) + 1)

Now, to calculate the correlation on the imputed data

cor <- imp %>%
  mice::complete("all") %>%
  map(select, -phb, -gen, -reg) %>%  
  map(stats::cor) %>%
  map(fisher.trans)
cor
## $`1`
##           age       hgt      wgt       bmi        hc        tv
## age       Inf 2.1902519 1.832968 0.7359532 1.2736674 1.1679848
## hgt 2.1902519       Inf 1.768200 0.6829808 1.5154050 1.0323224
## wgt 1.8329681 1.7682004      Inf 1.0761200 1.2061600 1.1911138
## bmi 0.7359532 0.6829808 1.076120       Inf 0.6751337 0.6951250
## hc  1.2736674 1.5154050 1.206160 0.6751337       Inf 0.7586322
## tv  1.1679848 1.0323224 1.191114 0.6951250 0.7586322       Inf
## 
## $`2`
##           age       hgt      wgt       bmi       hc       tv
## age       Inf 2.1895521 1.836513 0.7373255 1.285681 1.189392
## hgt 2.1895521       Inf 1.765145 0.6829795 1.530631 1.044601
## wgt 1.8365129 1.7651450      Inf 1.0777002 1.207762 1.198626
## bmi 0.7373255 0.6829795 1.077700       Inf 0.668867 0.703993
## hc  1.2856808 1.5306312 1.207762 0.6688670      Inf 0.788366
## tv  1.1893921 1.0446015 1.198626 0.7039930 0.788366      Inf
## 
## $`3`
##           age       hgt      wgt       bmi        hc        tv
## age       Inf 2.1995858 1.839059 0.7367487 1.2879032 1.1841660
## hgt 2.1995858       Inf 1.773961 0.6851173 1.5301979 1.0411363
## wgt 1.8390591 1.7739610      Inf 1.0751337 1.2208654 1.2073990
## bmi 0.7367487 0.6851173 1.075134       Inf 0.6833696 0.7224565
## hc  1.2879032 1.5301979 1.220865 0.6833696       Inf 0.7951580
## tv  1.1841660 1.0411363 1.207399 0.7224565 0.7951580       Inf
## 
## $`4`
##           age       hgt      wgt       bmi        hc        tv
## age       Inf 2.1927145 1.838166 0.7382143 1.2777258 1.1563792
## hgt 2.1927145       Inf 1.767625 0.6852551 1.5266533 1.0202259
## wgt 1.8381659 1.7676252      Inf 1.0790701 1.2110611 1.2128070
## bmi 0.7382143 0.6852551 1.079070       Inf 0.6772456 0.7465530
## hc  1.2777258 1.5266533 1.211061 0.6772456       Inf 0.7681024
## tv  1.1563792 1.0202259 1.212807 0.7465530 0.7681024       Inf
## 
## $`5`
##           age       hgt      wgt       bmi        hc        tv
## age       Inf 2.1984184 1.835550 0.7378012 1.2818658 1.1549327
## hgt 2.1984184       Inf 1.769970 0.6861452 1.5190048 1.0007754
## wgt 1.8355497 1.7699697      Inf 1.0794425 1.2144690 1.1644951
## bmi 0.7378012 0.6861452 1.079442       Inf 0.6839130 0.7059595
## hc  1.2818658 1.5190048 1.214469 0.6839130       Inf 0.7511938
## tv  1.1549327 1.0007754 1.164495 0.7059595 0.7511938       Inf

The object cor is a list over the \(m\) imputations where each listed index is a correlation matrix. To calculate the average over the correlation matrices, we can add the \(m\) listed indices and divide them by \(m\):

cor.rect <- Reduce("+", cor) / length(cor) # m is equal to the length of the list
cor.rect <- fisher.backtrans(cor.rect)

If we compare the wrong estimates in cor.wrong

cor.wrong
##      age  hgt  wgt  bmi   hc   tv
## age 1.00 0.98 0.95 0.63 0.86 0.86
## hgt 0.98 1.00 0.94 0.60 0.91 0.80
## wgt 0.95 0.94 1.00 0.79 0.84 0.86
## bmi 0.63 0.60 0.79 1.00 0.59 0.64
## hc  0.86 0.91 0.84 0.59 1.00 0.67
## tv  0.86 0.80 0.86 0.64 0.67 1.00

with the correct estimates in cor.rect

round(cor.rect, digits = 2)
##      age  hgt  wgt  bmi   hc   tv
## age  NaN 0.98 0.95 0.63 0.86 0.82
## hgt 0.98  NaN 0.94 0.59 0.91 0.77
## wgt 0.95 0.94  NaN 0.79 0.84 0.83
## bmi 0.63 0.59 0.79  NaN 0.59 0.61
## hc  0.86 0.91 0.84 0.59  NaN 0.65
## tv  0.82 0.77 0.83 0.61 0.65  NaN

We see that the wrong estimates in cor.wrong have the tendency to overestimate the correlation coefficient that is correctly combined following Rubin’s rules.

The correct estimates have a diagonal of NaN’s, because the tranformation of a correlation of 1 yields Inf and the backtransformation of Inf has no representation in real number space. We know the diagonal is supposed to be 1, so we can simply correct this

diag(cor.rect) <- 1
cor.rect
##           age       hgt       wgt       bmi        hc        tv
## age 1.0000000 0.9754590 0.9504533 0.6274557 0.8568491 0.8244551
## hgt 0.9754590 1.0000000 0.9434976 0.5944342 0.9094576 0.7730294
## wgt 0.9504533 0.9434976 1.0000000 0.7922677 0.8372974 0.8320887
## bmi 0.6274557 0.5944342 0.7922677 1.0000000 0.5900259 0.6136888
## hc  0.8568491 0.9094576 0.8372974 0.5900259 1.0000000 0.6482594
## tv  0.8244551 0.7730294 0.8320887 0.6136888 0.6482594 1.0000000

Why does the average data set not serve as a good basis for analysis?

In FIMD v2, paragraph 5.1.2 Stef mentions the following:

The average workflow is faster and easier than the correct methods, since there is no need to replicate the analyses \(m\) times. In the words of Dempster and Rubin (1983), this workflow is

seductive because it can lull the user into the pleasurable state of believing that the data are complete after all.

The ensuing statistical analysis does not know which data are observed and which are missing, and treats all data values as real, which will underestimate the uncertainty of the parameters. The reported standard errors and p-values after data-averaging are generally too low. The correlations between the variables of the averaged data will be too high. For example, the correlation matrix in the average data are more extreme than the average of the \(m\) correlation matrices, which is an example of ecological fallacy. As researchers tend to like low p-values and high correlations, there is a cynical reward for the analysis of the average data. However, analysis of the average data cannot give a fair representation of the uncertainties associated with the underlying data, and hence is not recommended.


So, please stay away from averaging the imputed data sets. Instead, use the correct workflow of analyzing the imputed sets seperately and combining the inference afterwards.


The importance of the imputation model

The mammalsleep dataset is part of mice. It contains the Allison and Cicchetti (1976) data for mammalian species. To learn more about this data, type

help(mammalsleep)

17. Get an overview of the data.

Find information about the size of the data, the variables measured and the amount of missingness.

head(mammalsleep)
##                     species       bw    brw sws  ps   ts  mls  gt pi sei odi
## 1          African elephant 6654.000 5712.0  NA  NA  3.3 38.6 645  3   5   3
## 2 African giant pouched rat    1.000    6.6 6.3 2.0  8.3  4.5  42  3   1   3
## 3                Arctic Fox    3.385   44.5  NA  NA 12.5 14.0  60  1   1   1
## 4    Arctic ground squirrel    0.920    5.7  NA  NA 16.5   NA  25  5   2   3
## 5            Asian elephant 2547.000 4603.0 2.1 1.8  3.9 69.0 624  3   5   4
## 6                    Baboon   10.550  179.5 9.1 0.7  9.8 27.0 180  4   4   4
summary(mammalsleep)
##                       species         bw                brw         
##  African elephant         : 1   Min.   :   0.005   Min.   :   0.14  
##  African giant pouched rat: 1   1st Qu.:   0.600   1st Qu.:   4.25  
##  Arctic Fox               : 1   Median :   3.342   Median :  17.25  
##  Arctic ground squirrel   : 1   Mean   : 198.790   Mean   : 283.13  
##  Asian elephant           : 1   3rd Qu.:  48.202   3rd Qu.: 166.00  
##  Baboon                   : 1   Max.   :6654.000   Max.   :5712.00  
##  (Other)                  :56                                       
##       sws               ps              ts             mls         
##  Min.   : 2.100   Min.   :0.000   Min.   : 2.60   Min.   :  2.000  
##  1st Qu.: 6.250   1st Qu.:0.900   1st Qu.: 8.05   1st Qu.:  6.625  
##  Median : 8.350   Median :1.800   Median :10.45   Median : 15.100  
##  Mean   : 8.673   Mean   :1.972   Mean   :10.53   Mean   : 19.878  
##  3rd Qu.:11.000   3rd Qu.:2.550   3rd Qu.:13.20   3rd Qu.: 27.750  
##  Max.   :17.900   Max.   :6.600   Max.   :19.90   Max.   :100.000  
##  NA's   :14       NA's   :12      NA's   :4       NA's   :4        
##        gt               pi             sei             odi       
##  Min.   : 12.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 35.75   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median : 79.00   Median :3.000   Median :2.000   Median :2.000  
##  Mean   :142.35   Mean   :2.871   Mean   :2.419   Mean   :2.613  
##  3rd Qu.:207.50   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :645.00   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##  NA's   :4
str(mammalsleep)
## 'data.frame':    62 obs. of  11 variables:
##  $ species: Factor w/ 62 levels "African elephant",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ bw     : num  6654 1 3.38 0.92 2547 ...
##  $ brw    : num  5712 6.6 44.5 5.7 4603 ...
##  $ sws    : num  NA 6.3 NA NA 2.1 9.1 15.8 5.2 10.9 8.3 ...
##  $ ps     : num  NA 2 NA NA 1.8 0.7 3.9 1 3.6 1.4 ...
##  $ ts     : num  3.3 8.3 12.5 16.5 3.9 9.8 19.7 6.2 14.5 9.7 ...
##  $ mls    : num  38.6 4.5 14 NA 69 27 19 30.4 28 50 ...
##  $ gt     : num  645 42 60 25 624 180 35 392 63 230 ...
##  $ pi     : int  3 3 1 5 3 4 1 4 1 1 ...
##  $ sei    : int  5 1 1 2 5 4 1 5 2 1 ...
##  $ odi    : int  3 3 1 3 4 4 1 4 1 1 ...

As we have seen before, the function md.pattern() can be used to display all different missing data patterns. How many different missing data patterns are present in the mammalsleep dataframe and which pattern occurs most frequently in the data?

md.pattern(mammalsleep)

##    species bw brw pi sei odi ts mls gt ps sws   
## 42       1  1   1  1   1   1  1   1  1  1   1  0
## 9        1  1   1  1   1   1  1   1  1  0   0  2
## 3        1  1   1  1   1   1  1   1  0  1   1  1
## 2        1  1   1  1   1   1  1   0  1  1   1  1
## 1        1  1   1  1   1   1  1   0  1  0   0  3
## 1        1  1   1  1   1   1  1   0  0  1   1  2
## 2        1  1   1  1   1   1  0   1  1  1   0  2
## 2        1  1   1  1   1   1  0   1  1  0   0  3
##          0  0   0  0   0   0  4   4  4 12  14 38

Answer: 8 patterns in total, with the pattern where everything is observed occuring the most (42 times).


18. Generate five imputed datasets with the default method pmm. Give the algorithm 10 iterations.

imp1 <- mice(mammalsleep, maxit = 10, print=F)
## Warning: Number of logged events: 525

We ignore the loggedEvents for now: it contains a list of all decisions and exclusions that are performed by the mice algorithm. To inspect the trace lines for assessing algorithmic convergence:

plot(imp1)


19. Perform a regression analysis on the imputed dataset with sws as dependent variable and log10(bw) and odi as independent variables.

fit1 <- with(imp1, lm(sws ~ log10(bw) + odi))

20. Pool the regression analysis and inspect the pooled analysis.

est1 <- pool(fit1)
est1
## Class: mipo    m = 5 
##          term m   estimate       ubar           b          t dfcom        df
## 1 (Intercept) 5 10.6803524 0.70251366 0.910812327 1.79548846    59  7.277917
## 2   log10(bw) 5 -1.1192328 0.09917346 0.405493877 0.58576612    59  3.623712
## 3         odi 5 -0.6006961 0.08783420 0.007314457 0.09661155    59 46.887060
##          riv     lambda       fmi
## 1 1.55580574 0.60873396 0.6848712
## 2 4.90648031 0.83069443 0.8818155
## 3 0.09993088 0.09085197 0.1273002
summary(est1)
##          term   estimate std.error statistic        df      p.value
## 1 (Intercept) 10.6803524 1.3399584  7.970660  7.277917 0.0000756456
## 2   log10(bw) -1.1192328 0.7653536 -1.462373  3.623712 0.2245450701
## 3         odi -0.6006961 0.3108240 -1.932593 46.887060 0.0593397704

The fmi and lambda are much too high. This is due to species being included in the imputation model. Because there are 62 species and mice automatically converts factors (categorical variables) to dummy variables, each species is modeled by its own imputation model.


21. Impute mammalsleep again, but now exclude species from the data.

imp2 <- mice(mammalsleep[ , -1], maxit = 10, print = F)
## Warning: Number of logged events: 23

22. Compute and pool the regression analysis again.

fit2 <- with(imp2, lm(sws ~ log10(bw) + odi))
est2 <- pool(fit2)
est2
## Class: mipo    m = 5 
##          term m   estimate       ubar           b          t dfcom       df
## 1 (Intercept) 5 11.6112094 0.55887869 0.026313179 0.59045450    59 52.03293
## 2   log10(bw) 5 -1.1500152 0.07889659 0.005715367 0.08575504    59 48.45970
## 3         odi 5 -0.9129462 0.06987574 0.004631005 0.07543294    59 49.34892
##          riv     lambda        fmi
## 1 0.05649851 0.05347713 0.08787555
## 2 0.08692949 0.07997712 0.11573414
## 3 0.07952984 0.07367081 0.10906138
summary(est2)
##          term   estimate std.error statistic       df     p.value
## 1 (Intercept) 11.6112094 0.7684104 15.110688 52.03293 0.000000000
## 2   log10(bw) -1.1500152 0.2928396 -3.927116 48.45970 0.000271801
## 3         odi -0.9129462 0.2746506 -3.324028 49.34892 0.001678383

Note that the fmi and lambda have dramatically decreased. The imputation model has been greatly improved.


23. Plot the trace lines for the new imputations

plot(imp2)

Even though the fraction of information missing due to nonresponse (fmi) and the relative increase in variance due to nonresponse (lambda) are nice and low, the convergence turns out to be a real problem. The reason is the structure in the data. Total sleep (ts) is the sum of paradoxical sleep (ps) and short wave sleep (sws). This relation is ignored in the imputations, but it is necessary to take this relation into account. mice offers a routine called passive imputation, which allows users to take transformations, combinations and recoded variables into account when imputing their data.


Passive Imputation

There is often a need for transformed, combined or recoded versions of the data. In the case of incomplete data, one could impute the original, and transform the completed original afterwards, or transform the incomplete original and impute the transformed version. If, however, both the original and the transformed version are needed within the imputation algorithm, neither of these approaches work: One cannot be sure that the transformation holds between the imputed values of the original and transformed versions. mice has a built-in approach, called passive imputation, to deal with situations as described above. The goal of passive imputation is to maintain the consistency among different transformations of the same data. As an example, consider the following deterministic function in the boys data \[\text{BMI} = \frac{\text{Weight (kg)}}{\text{Height}^2 \text{(m)}}\] or the compositional relation in the mammalsleep data: \[\text{ts} = \text{ps}+\text{sws}\]


24. Use passive imputation to impute the deterministic sleep relation in the mammalsleep data. Name the new multiply imputed dataset pas.imp.

First, we create a method vector:

meth <- make.method(mammalsleep)
meth
## species      bw     brw     sws      ps      ts     mls      gt      pi     sei 
##      ""      ""      ""   "pmm"   "pmm"   "pmm"   "pmm"   "pmm"      ""      "" 
##     odi 
##      ""

and a predictorMatrix:

pred <- make.predictorMatrix(mammalsleep)
pred
##         species bw brw sws ps ts mls gt pi sei odi
## species       0  1   1   1  1  1   1  1  1   1   1
## bw            1  0   1   1  1  1   1  1  1   1   1
## brw           1  1   0   1  1  1   1  1  1   1   1
## sws           1  1   1   0  1  1   1  1  1   1   1
## ps            1  1   1   1  0  1   1  1  1   1   1
## ts            1  1   1   1  1  0   1  1  1   1   1
## mls           1  1   1   1  1  1   0  1  1   1   1
## gt            1  1   1   1  1  1   1  0  1   1   1
## pi            1  1   1   1  1  1   1  1  0   1   1
## sei           1  1   1   1  1  1   1  1  1   0   1
## odi           1  1   1   1  1  1   1  1  1   1   0

We add the call for passive imputation to the ts element in the meth object

meth["ts"]<- "~ I(sws + ps)"
meth
##         species              bw             brw             sws              ps 
##              ""              ""              ""           "pmm"           "pmm" 
##              ts             mls              gt              pi             sei 
## "~ I(sws + ps)"           "pmm"           "pmm"              ""              "" 
##             odi 
##              ""

and set the predictor relations for ts with sws and ps to 0. Also, we have to exclude Species as a predictor

pred[c("sws", "ps"), "ts"] <- 0
pred[, "species"] <- 0
pred
##         species bw brw sws ps ts mls gt pi sei odi
## species       0  1   1   1  1  1   1  1  1   1   1
## bw            0  0   1   1  1  1   1  1  1   1   1
## brw           0  1   0   1  1  1   1  1  1   1   1
## sws           0  1   1   0  1  0   1  1  1   1   1
## ps            0  1   1   1  0  0   1  1  1   1   1
## ts            0  1   1   1  1  0   1  1  1   1   1
## mls           0  1   1   1  1  1   0  1  1   1   1
## gt            0  1   1   1  1  1   1  0  1   1   1
## pi            0  1   1   1  1  1   1  1  0   1   1
## sei           0  1   1   1  1  1   1  1  1   0   1
## odi           0  1   1   1  1  1   1  1  1   1   0

This avoids circularity problems where ts would feed back into sws and ps, from which it is calculated:

We can then run the imputations as

pas.imp <- mice(mammalsleep, 
                meth = meth, 
                pred = pred, 
                maxit = 50, 
                seed = 123, 
                print = F)

We used a custom predictor matrix and method vector to tailor our imputation approach to the passive imputation problem. We made sure to exclude ts as a predictor for the imputation of sws and ps to avoid circularity.

We also gave the imputation algorithm 10 iterations to converge and fixed the seed to 123 for this mice instance. This means that even when people do not fix the overall R seed for a session, exact replication of results can be obtained by simply fixing the seed for the random number generator within mice. Naturally, the same input (data) is each time required to yield the same output (mids-object).

When we study convergence, we see that the apparent non-convergence that we saw before has now disappeared with the use of passive imputation for the deterministic system (ts, ps, sws).

plot(pas.imp)


For fun

What you shouldn’t do with passive imputation!

Never set all relations fixed. You will remain with the starting values and waste your computer’s energy (and your own).

meth<- make.method(boys)
pred <- make.predictorMatrix(boys)
meth["bmi"]<- "~ I(wgt / (hgt / 100)^2)"
meth["wgt"]<- "~ I(bmi * (hgt / 100)^2)"
meth["hgt"]<- "~ I(sqrt(wgt / bmi) * 100)"
pred[c("bmi", "wgt", "hgt"), ] <- 0
imp.path <- mice(boys, 
                 meth=meth, 
                 pred=pred, 
                 seed=123)
## 
##  iter imp variable
##   1   1  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   1   2  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   1   3  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   1   4  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   1   5  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   2   1  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   2   2  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   2   3  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   2   4  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   2   5  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   3   1  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   3   2  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   3   3  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   3   4  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   3   5  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   4   1  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   4   2  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   4   3  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   4   4  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   4   5  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   5   1  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   5   2  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   5   3  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   5   4  hgt  wgt  bmi  hc  gen  phb  tv  reg
##   5   5  hgt  wgt  bmi  hc  gen  phb  tv  reg
plot(imp.path, c("hgt", "wgt", "bmi"))

We named the mids object imp.path, because the nonconvergence is pathological in this example!


Conclusion

We have seen that the practical execution of multiple imputation and pooling is straightforward with the R package mice. The package is designed to allow you to assess and control the imputations themselves, the convergence of the algorithm and the distributions and multivariate relations of the observed and imputed data.

It is important to ‘gain’ this control as a user. After all, we are imputing values and we aim to properly adress the uncertainty about the missingness problem.


Additional materials

A more detailed practical guide to mice in R can be found here


References

Rubin, D. B. Multiple imputation for nonresponse in surveys. John Wiley & Sons, 1987. Amazon

Schafer, J.L. (1997). Analysis of Incomplete Multivariate Data. London: Chapman & Hall. Table 6.14. Amazon

Van Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. pdf


- End of practical


sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.3
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] purrr_0.3.4     magrittr_2.0.2  dplyr_1.0.8     lattice_0.20-45
## [5] mice_3.14.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.8.3     highr_0.9        bslib_0.3.1      compiler_4.1.3  
##  [5] pillar_1.7.0     jquerylib_0.1.4  tools_4.1.3      digest_0.6.29   
##  [9] jsonlite_1.8.0   evaluate_0.15    lifecycle_1.0.1  tibble_3.1.6    
## [13] pkgconfig_2.0.3  rlang_1.0.2      cli_3.2.0        DBI_1.1.2       
## [17] rstudioapi_0.13  yaml_2.3.5       xfun_0.30        fastmap_1.1.0   
## [21] withr_2.5.0      stringr_1.4.0    knitr_1.37       generics_0.1.2  
## [25] vctrs_0.3.8      sass_0.4.0       nnet_7.3-17      grid_4.1.3      
## [29] tidyselect_1.1.2 glue_1.6.2       R6_2.5.1         fansi_1.0.2     
## [33] rmarkdown_2.13   tidyr_1.2.0      MASS_7.3-55      backports_1.4.1 
## [37] htmltools_0.5.2  ellipsis_0.3.2   assertthat_0.2.1 utf8_1.2.2      
## [41] stringi_1.7.6    broom_0.7.12     crayon_1.5.0